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Introduction 

Let (Si, Tj) ieN . be a sequence of i.i.d. random variables with values in W x 
and with common law h s ®Ht- Let (ei)^* be a sequence of i.i.d. standard normal 
random variables on W 1 , independent of the preceding sequence. We consider in 
the sequel the inverse problem which consists in estimating the law fi,s given the 
finite sequence (Yi, Tj)^^ where 

Y r .= f(S t ,T t )+ae t , (1) 

and where / : l p xf" — ► M n is a known smooth function, which can be in particular 
nonlinear in the first variable. The asymptotic is taken in N, and n remains 
fixed. It is assumed that a is some known non-negative variance parameter. We 
emphasise the fact that in the triplet (Yi, T, Si), we observe only the couple (Y i} T), 
and we are interested in the estimation of the joint law of the unobserved random 
variables 5j. 

In the sequel, C(Z) denotes the law of the random variable Z. For example, one 
has C(Si, Tj) = us ® Ht- 111 the same spirit, C(Z\ \ Z 2 ) denotes the conditional law 
of Z\ given Z 2 - Finally, we denote by V(M. d ) the convex set of probability measures 
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on M. d equipped with its Borel cr-field and with the C;,(R d ,R) dual topology. We 
will sometimes denote S, T, Y for any random variable with law fis = C(Si), /j,t = 
£(Ti), and \iy = C{Y\) respectively. Finally we will denote by yi = (y^i, ■ ■ ■ , yi, n ), 
ti = (ti,i, ■ ■ ■ , ti,n) an d Si = (sj i, . . . , Si tP ) any realisation of the random variables 
Yi, Ti and Si respectively. 

Before starting the mathematical analysis of the problem, let us give briefly 
some explanations regarding the notations and the motivations. The random vari- 
ables Yi = (Yi t i, . . • , y%,ri) represents the values measured for individual number i 
at times T = (Tj jl; . . . ,T i ri ). The random variable Si stands for the individual 
parameter and the random variable a £j models the (homoscedastic) random noise 
which is added to the possibly nonlinear true value f{Si,T^). This kind of data 
is known as repeated measurements, or called longitudinal since each individual 
(from % = 1 to % = N) is observed rii := n times and provides a whole vector 
of consecutive observations Yi = (Y^i, . . . ,Yi >n ) performed at the corresponding 
individual times (Tj 1, . . . ,T i<n ) = Tj. Since T is a sequence of measuring times, 
one can assume for simplicity that the law \xt is a tensor product of uniform laws 
on disjoint consecutive compact intervals of the real half line R+. One can think 
about [It and n as the design of the experiment, whereas / and £(Si \ Ti) and C(ei) 
correspond to the model chosen for the inverse problem, relating the individual 
observation Yi to the individual parameter Si and to the individual measuring 
times Tj. Usually in applications, / is of the form 

f(s, Ti) = (q s (T iA ), . . . , q t (T it n)), (2) 

where for any s in MP, q s : R — > R is a smooth function depending smoothly on 
the parameter s, for example a linear combination of time dependent exponentials 
with coefficients related to s. Function q s represents in such a scheme the true 
evolution in time of the phenomenon of interest for an individual of parameter s. 

Practical applications of models like (pQ) are numerous in signal transmission, 
in tomography, in econometrics, in geophysics, etc, cf. [22]. Let us give briefly 
a concrete example in Biology. We consider the decay of the concentration of 
a medicine in human blood. One has p = 2 and q(A,a)(t) = Aexp(-at) in (J2J), 
where A stands for the quantity of medicine in the blood at time 0, and where a 
stands for the rate at which the medicine is eliminated. At the beginning of the 
experiment, the medicine is given to N independent patients. For patient number 
i, n measurements (^j)i^j^n °f t ne concentration of the medicine in blood are 
made, at times (tij)i^j^n- One of the simplest model used in this context is 

Y i,j = QiAua^hj) + ae i,h with « = 1, ■ ■ ■ , JV and j = 1, . . . , n. 

If we state Si := (A iy ctj), the random variables S±, . . . , Sjy are i.i.d. and correspond 
to the biological specificity of each patient. We are interested in the estimation 
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of the distribution fi$ of the common law of these random variables (population 
pharmacokinetics). Deconvolution methods are useless since the required condition 
n — > +00 is unrealistic. The number of observations n for each individual remains 
small, a few units in practice. Our framework where the asymptotic is taken on 
the number of individuals N is the only mean to perform the estimation of the 
"population law" 

A stochastic inverse problem is an inverse problem for which the subject of 
the inversion is a probability measure, like in The related theoretical and 
applied literature is huge, with many connected components. It contains in partic- 
ular deconvolution problems, mixtures models, (non)linear mixed effects models, 
(non)linear filtering problems, etc. Even a common keyword or phrase like our 
"stochastic inverse problems" is most of the time missing and/or ignored. There- 
fore, it is quite hard to give a descent state of the art, but a bit less difficult is to 
show various natures of a particular subclass of problems. 

We emphasise the fact that is not a standard regression problem since / 
is known whereas the Si and their law are unknown. Moreover, our problem 
in not of Ibragimov and Hasminskii type since the Si are not observed. Notice 
that when n is very large deconvolution techniques can give an estimation of each 
Si. The approach developed recently in [T3j in useless for our problem since we 
consider an asymptotic in N and not in n. 

One of the common difficulties of stochastic inverse problems like ([IJ lies in 
the fact that they are ill-posed. The inverse of the underlying operator is not 
continuous in general, so that a small perturbation of the data may induce a large 
change for the common law of the unobserved random variable. If the unknown 
was a function in a Hilbert space instead of a probability density function, one 
could try a singular value decomposition (SVD), following for example Cavalier, 
Golubev, Picard and Tsybakov in [5J. 

Several authors have investigated nonparametric maximum likelihood estima- 
tion for stochastic inverse problems, and related Expectation Maximisation (EM, 
cf. jH]) algorithms. In the context of mixtures, Lindsay showed in [TH Ho] by using 
elementary convex analysis that the fully nonparametric maximum likelihood is 
achieved by a discrete probability measure with finite number of atoms related to 
the sample size, connecting by this way this kind of problems with convex anal- 
ysis algorithms (Simplex algorithm, Fedorov methods, etc). One can find some 
developments in [23 H3 E3 12] • The consistency of such estimators was established 
at least by Pfanzagl in |23j . In [23], Schumitzky gave an EM like algorithm for 
Lindsay's estimator. In another direction, Eggermont and Lariccia have developed 
smoothing techniques for problems involving Fredholm integral operators, cf. [0] 
and references therein. 

To sum up, our aim in this paper is to estimate /is, the common law of the unob- 
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served i.i.d. random variables Si in when belongs to some class Ts C V(MP). 
The rest of the paper is divided as follows. Section Q introduces a nonparametric 
Likelihood Estimator (NPML) for and is devoted to establish its consistency up 
to identifiability. Sectional presents finite dimensional and algorithmic approaches 
to approximate the NPML. Finally, in Section |SJ various related questions are 
discussed. 

1 An NPML and its consistency 

Conditionally on the Si, the Yj are independent but not identically distributed, 
due to the dependence over Tj. However, since the individual observed datum 
consists in X^ := (Y,Tj), it is quite natural to see Si as the unique unobserved 
random variable in the triplet (Y^S^Tj). The law C(Xi) = £(Yj,Tj) is nothing 
else but 



where u (y,t) = x" and where 7 CTjn is the Gaussian probability density function on 
R n given by 7 ff ,„(u) := (2na 2 )- n '/ 2 exp(-\\x\\ 2 2 /2a 2 ). Similarly, the law C(Y { ) of Y { 
is the following mixture 



where the mixing law is /is ® fir and where the mixed family is the following 
/-deformed Gaussian location family 



K,n(« - f(s,t)) where (s,t) eWx W 1 } = 7^ * {S m) where (s,t) G W x W 1 }. 



Assume now that the law /i T has a density ip with respect to the Lebesgue measure 
on R™ . Then, one has that the law C{Xj) = C(Yi, Tj) is absolutely continuous with 
respect to the Lebesgue measure on R n x with probability density function 
K(ns) given by 



When us has density <p with respect to Lebesgue's measure on IR P , we will denote 
K(<£>) instead of K(/is), viewing by this way K as a linear operator over probability 
density functions. 






(3) 
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Here again, the law C(Xi) = C(Yi, Tj) is a mixture, with mixing law /is and mixed 
family 

{i){t)la,n{* ~ /(*,*)) With (S,t) ER P X W 1 }. 

Notice that K.(p,g)(y, t) is always positive, and thus, logK(/x s ) always makes sense. 
The log-likelihood can be expressed by mean of the unknown law fig as follows 

M//5) :=P7vlogK(^ 5 ), (4) 

where P^ is the empirical measure of the sample (X^k^h defined by 



! A 



1 N 



N 

i=l 



Notice that we have used above the standard notation F^F to denote the expec- 
tation of function F with respect to probability law Pjy. When / is of the form 
(J2J), the log-likelihood defined above in (jlj) reads 

Ljv(/x s ) = -r= Y] log / exp ( -— y2(Y id - q s {Tij)) 2 J dfi s (s) + C 

l N n , v 

= -^7 E E lo § / ex p - 9.( r «)) a <w*) + c 

iV i=1 j=1 Jsmp \ za J 



where 



1 N 

C := ~ log(27T ( r 2 ) + - ^ logV(Ti,i, • • • , T i>T 



N 

i=i 

The quantity C does not have any effect on the arg-maximum of the log-likelihood 
functional L N . In particular, the density ip of ht does not play a direct role in the 
NPML © below since one can rewrite L^r as follows 

L N (n s ) = F N log V + Pjv log K # f> 5 ) , 

where 

(K # (/i 5 ))(?/,t) := / j a>n (y- f(s,t))dn s (s). 

On any set JF, the arg-maximum of Ljy is equal to the arg-maximum of defined 
by 

L# (//«?) :=V N logK*(ji s ). 

The functional does not depend on /it directly, but only implicitly via the sam- 
ple Ti, . . . , Tjv throughout F N . However, the law [At plays a role in identifiability, 
and the good choice of this law is always a crucial issue. 
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Definition 1.1 (Identif iability) . We say that the mixture model ((H) is identifiable 
if and only if K is injective, as a map from Ts to V(M. n ). Namely, for any couple 
(/!, v) G Ts x Ts with v //, one has K(/i) 7^ K(z/) in V(M. n ). Similarly, we say 
that G Ts is identifiable in (TJQ) if and only if K(V) 7^ K(/i) in V(M. n ) for any 
z/ G with z/ 7^ yU. 

Clearly, the model is identifiable if and only if every element of Ts is identifiable. 
Identifiability is essential for any estimation issue of the true mixing law fis- This 
condition is quite difficult to check in great generality However, one can find some 
clues for example in [6, and references therein. In practice, and when possible, 
identifiability must be checked for the particular model considered, and is deeply 
related to the properties of function / and to the distribution ht of the observation 
times. We are now able to state the following Theorem. 

Theorem 1.2 (Consistency of NPML). Assume that Ts C V(W P ) is a compact 
convex subset of a linear space, that the model is identifiable, that C(T) = ip(t) dt, 
that fis £ Fs, and that for almost all (y, t) G R n xR™ , the map K.(»)(y, t) : JF S — >• R 
is continuous. Then, the NPML estimator p^N given by 

p^}f : = arg max L N (fj) (6) 

is well defined, unique, and converges almost surely toward (is when N goes to 
+00. 

Proof. The random map L^r is a.s. continuous from JF S to R since the map 
K(«)(y, t) : Ts — > R is continuous for any (y, t) G R n x R™ . By linearity and iden- 
tifiability of K and strict concavity of the logarithm, the map L^v is a.s. strictly 
concave. Thus, it achieves a.s. a unique sup over the compact convex set Ts- The 
existence and unicity of the estimator jXs^N is therefore proved. Finally, thanks 
to our choice of settings, the desired consistency result follows from pH| Theorem 
3.4] and [23 Section 5], since the required hypotheses are fulfilled: 

• Condition 1. Ts is a compact Hausdorff space, and a subset of a linear 
space. 

• Condition 2. For almost all (?/,, t^i^i^v, the map YliLi {yi, U) is con- 
tinuous on Ts for the topology of Ts- 

• Condition 3. For almost all (y, t) G R n x R™ , the map K(«)(?/, t) is concave 
on T s - 

□ 

Remark 1.3. Let us give various remarks about Theorem 11.21 and its extensions. 
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1. Identifiability. Following again |23j . one can relax the identifiability of the 
model to the identifiability of /is, but it is not really useful in practice since 
lis is unknown! For any x := (y, t) G M n x M n , let us denote by k x : M p — > M + 
the function k x (s) := 7 n ,o-(z/ — /( s >^))- Let T be the biggest open subset of 
M n such that ip > over T. Then, identifiability of the model corresponds 
to a condition on the set of functions C := {k x : M p — > M + with i6R"x T} 
appearing in the mixture Namely, it must separate the elements of 

In other words, when / is smooth, the C class must be large enough to fully 
characterise any element of Ts by duality as a set of test functions for a 
distribution of order zero in the sense of L. Schwartz distributions Theory. 
Such a necessary and sufficient separation condition relies on both / and T 
and can, depending on the particular choice of J 7 , be weaker than the full 
injectivity of / in the first variable when the second runs over T. Notice that 
the smoothness of / together with its injectivity in the first variable induces 
in general a "degree of freedom" requirement on (n,p). If Ts C D'(K) for 
some compact subset K of MP, then C separates the elements of [/,§ as soon as 
the vector space spanned by C is dense in C°°(K) for the uniform topology. 

2. Heteroscedasticity. At least when the elements of J-g are compactly sup- 
ported, Theorem 11.21 remains true for a class of heteroscedastic models of the 
form 

Yi = f(S h Ti) + ae r + g(S h T t ) ■ e h (7) 

where a > is known, where g : MP x M n — ► MP is a smooth function and 
where the dot mark "•" denotes the component-wise vectors multiplication. 
One can also incorporate a matrix between g and £j. Notice that condition 
g ^ ensures that the variance of the conditional law is bounded below by 
a 2 and thus, the mixture makes sense. The mixed family is a location-scale 
(/, g)-deformed Gaussian family: 

{l^ +9 {s,tW'\n{* ~ /M)) where (s,t) e M p x M n }. 

In concrete applications, it is quite usual to state that g and / are co-linear 
in the heteroscedastic model above, say g = a' f, making the noise roughly 
proportional to the measured value. 

3. Non Gaussian noise. Theorem 11.21 remains true when the Gaussian law 
of the noise £j in is replaced by an absolutely continuous law with re- 
spect to the Lebesgue measure on M n . The related location mixed family 
is not Gaussian in that case, but this does not block the derivation of the 
consistency of the NPML. 
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4. Non homogeneity via censure. Let (nj)j g N* be a sequence of i.i.d. ran- 
dom variables independent of (Si, Tj)j 6 N*, with values in the set M n of subsets 
of {1, . . . ,n}, and with common law p K : = P(ni = k) > for any k G N n . 
Assume that for each i, one has access only to Z,- L := (Yij, j G n*) instead of 
the whole vector of measurements Fj := (Y^i, . . . , Y^ n ) itself. Then, the new 
inverse problem corresponds to the new sample 

((Z 1 , T 1 , m), . . . , (Z N , T N , njv)) 

which is the censored version of the original sample with unobserved Si values 

Si, Ti), . . . , (Y N , Si, T N )). 

The problem is that the Zj are not in the same space, but are still inde- 
pendent. Our goal then is to rewrite the problem in a i.i.d framework. One 
method consists in extending the data space to the larger direct sum space 
E := ® K eN„E K , where E K is a copy of M' K ' corresponding to the components 
present in k, where \k\ := #/c. It is then easy to write down the law of 
(Zi,Ti, ni). Such a model is quite heavy to write down but gives rise to a 
simple extended log-likelihood: 

L N (ns) := Ptv logp K + F N log^ + P^v log K K (ns), 
where for any ji G J~s 

K K (fj,)(z,t,K) := / la,\ K \(z-n K (f(s,t)))dfx(s), 
JseRp 

where n K is the projection of E on E K and where the empirical measure Pat 
is now 

1 N 

i=l 

The PArlogj9 K + F^logip part of the log-likelihood does not depend on [1$, 
and thus, it does not influence the arg-maximum of the log-likelihood and 
can be safely removed. For each i, the Tjj involved in the log-likelihood are 
those with j G ni. Finally, one can notice that such type of independent 
censoring does not correspond to all realistic censure, since in practice, the 
nj can depend on the Y { it self via for example 

{hY iA > T }, ■ ■ ■ >I{y;,„>t}) 
where r is a detection threshold. 
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5. Continuity of the operator. The continuity assumption on K relies in 
general on function /, on the nature of Ts-, and on the law of the noise e*, 
which is Gaussian and homoscedastic here. Some concrete examples of T 
are given below. 

6. Full extension. Mixing all the previous extensions is delicate. 
Example 1.4. Consider for instance the set Ts C V{M. P ) defined by 

T s :=T™' A ■= {<p(s)ds; where <p e C^([0,M]) and ||^|| L1 = 1, HIV^IH^ < A}, 

(8) 

where K is a fixed compact subset of W and where M, A are fixed non negative 
real numbers. Equipped with the L°° topology, this set is a compact convex subset 
of a linear space, as required by Theorem 11.21 Since the underlying mixture model 
is a "Gaussian position" one, we get for any couple (<pi,<f2) £ J~s x J~s an d an Y 
(y, t) e R n x R n 

\KM(y,t)-K( V2 )(y,t)\ < ||^i - HUHL(2™ 2 )- n/2 , 

which gives the L°° continuity of K(«)(?/, t) for any couple (y,t) 6K"x M. n . Since 
we deal with a "Gaussian position model" (homoscedasticity), the operator norm 
does not depend on (y,t) and function / plays not role. The L°° a.s. consistency 
up to identifiability of the NPML follows then from Theorem 11.21 

Example 1.5. Consider the set Qs C V(M. P ) defined by 

Q s ■= Qg' a := {ip{s) ds; where (p E Y[ a {K) with ||(^|| L i = 1 and ^ ^4}, 

where K is a fixed compact subset of MP, A is a fixed non negative real number 
and H a (K) is the Sobolev space over the compact K. Provided that a > \ — -, 
Rellich-Sobolev embedding Theorem yields that JF S is a compact convex subset of 
a linear space for the L 2 topology, cf. jUHHj, as required by Theorem 11.21 Since 
the underlying mixture model is a "Gaussian position" one, we get for any couple 
Oi, Pi) efs^Fs and any (y, t) e W n x R n 

|K( W )(y,t)-K(^)(»,t)| < ||^i-^|| 2 ||^|| 0O (47ra 2 )-"/ 4 , 

which gives the L 2 continuity of K(«)(y, t) for any couple (y,t) € M. n x M. n . Since 
we deal with a "Gaussian position model" (homoscedasticity), the operator norm 
does not depend on (y,t) and function / plays not role. The L 2 a.s. consistency 
up to identifiability of the NPML follows then from Theorem 11.21 
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2 Algorithms for the NPML 

2.1 Finite dimensional approximation 

The first step towards a practical implementation is to transform the maximum 
fts^N of the log-likelihood Ljy over the whole infinite dimensional class J-'g into a 
maximum ^s,N,m over a finite dimensional convex subset Fg^, where (J r s,rn)m€N* 
is an exhaustive sequence of subsets of J-'s, i-e. adh(U me N*^ r m ) = 3~ '■ 
Theorem 2.1. Assume that Tg is a metric space. Let (J r s,m)meN* be an exhaustive 
sequence of finite dimensional closed convex subsets of Fg. Under the assumptions 
of Theorem EUj and for any fixed sample of size N , the approximated NPML 
estimator jlgj^n given by 

fis~Nsn ■= arg max L n ((jl). (9) 

is well defined, unique, and converges toward the NPML p^N when m goes to +oo. 

Proof. We proceed at fixed N. Since !Fg )m is a compact convex subset, the approx- 
imated NPML estimator ftgj^n exists, as it was the case for the NPML estimator 
fis~N in Theorem 11.21 Let us now establish the convergence. By the definition of 
jIs^N~m and ffg^q one has that 

Ljv(/iS,AT,m) ^ L N (fig jN ). 

In the other hand, there exists a sequence (/i m ) me N« converging towards fig^ in 
JF 5 and such that /i m G Tg^ m for any m G N*. Hence, lower semi continuity of L N 
induces that, for any e > 0, there exists m £ G N* such that for any m ^ m £ , 

^n(^s,n) — £ ^ Ljv(A*m)- 
But by definition of fbgj^m we have 

As a result, the following bound holds for any e > and any m > m e 

L n {P^n) ~ £ < ^Nif^gN^n) < ^n(P^n)- (10) 

If /i* G JF 5 is an adherence value of the sequence (jJ>s,N,m)meVi*i corresponding 
to the limit point of a subsequence (fAg^m^ke®*, then //* = /t^. Namely, if 
it was not the case, then |T0|) will implies that (^(f!gN~m k ))km* converges toward 
Lat(/As^v), and thus that Lat(//*) = L7v(/Asviv), which contradicts the unicity of /ts^v 
as a maximum of Ln over Hence, fTg^ is the unique adherence value of the 
sequence (figj^n)rn<=N* , and the compacity of J-'g yields finally that (jTgJy^n)m£N* 
converges towards ftgN, which is exactly the desired result. □ 

Remark 2.2. The rate of convergence of (ps^rrdm&i* towards fhgjy when m goes 
to +00 depends on the regularity of J-g )m and L N . 
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2.2 A Gradient algorithm for log-likelihood maximisation 

Since for any m G Ts,m an d any couple (/x, v) in Ts,m x Fs,mi 



Ljv(/i) - Ltv(z/) = Pat log 



k(aQ 



the sieves log-likelihood estimator fisj^m defined in (jHJ) can be viewed as the solu- 
tion of the following optimisation issue: 



find fts~^Z such that V/x G F s ,m, P;v log 



K(/x) 



^ 0. 



(11) 



N.m) 



By using the concavity of the objective function, Pfanzagl has proved in [21] that 
one may switch, in the definition of the estimator in (fTTJl . from the log function 
to any other function L : M*. — > R, provided that it is concave, strictly increasing, 
with L(l) = 0. 



find flsj^m such that V/x G Ts, m -, ^nL 



K(/x) 



K(/^ 



TV, m J 



^ 0. 



(12) 



As a result defining the estimator for a particular L is enough to get inequality ()12j) 
for all "contrast" function L satisfying the previous assumptions. In particular, 
the estimator jTsJ^m can be obtained for the special choice L(t) — t — 1, which 
corresponds exactly to the definition of the EM algorithm iteration. Hence, max- 
imising the estimator can be practically computed via the EM algorithm, while 
Theorem 12. II still applies, proving consistency of the estimator. This invariance in 
L relies on the "concavity" of the model, as explained in 



3 Discussion 

3.1 Heuristics for the NPML in Theorem ITT21 

As usual for maximum log-likelihood, the strong law of large numbers yields 
that (Pjv) tvgn* converges a.s. toward K(/x 5 ) in V(M, n ). In other words, C(Y) = 
K(/xs). Consequently, for any /x G Fs, (J-> N (fi)) NeN * converges toward 

M/x) := -Ent(K(/x 5 ) | K(/x)) + H(K(/x 5 )), 

where Ent(K(/x s ) | K(/x)) = J (logK(/is)-logK(/x))K(/i,s) is the Kullback-Leibler 
relative entropy of K(/x s ) with respect to K(/i) and where H(K(/x 5 )) = L QO (/X5) 
is the Shannon entropy of K(/xs). In other words, the log-likelihood random func- 
tional Iin converges toward the deterministic functional when N goes to +oo. 
This deterministic limit is the relative entropy functional Ent(« | K(/xs)), up to 
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the additive constant H(K(/is)) which does not play any role for the arg-maximum 
problem. Since K is injective (identifiability), is strictly concave with unique 
maximum achieved at point [is- The NPML estimator replaces the asymptotic 
arg-maximum fi$ with the finite N arg-maximum jZsjj. The non-asymptotic log- 
likelihood L^v is not a relative entropy, but remains strictly concave. The EM 
algorithm /ijv,fc+i = F N (fi Ntk ) consists in approximating fisjv by finding an en- 
tropic lower bound functional for which touches at the current step fiN,k- 
The EM algorithm in this context can be seen also as a gradient like algorithm 
A*iv,fc+i = ^N,k + GAr(/ijv jfc ) for the concave functional L N , where G^ is the Gateau 
directional derivative of Ljy- It turns out that this gradient like approach appears 
as a fixed point iteration fiN,k+i = F N (fi N)k ) where F N = Gjv + Id. The fixed 
point problem Fjy(/i) = /i corresponds exactly to Bayes rule where the unknown 
Us is replaced by the current step \i and where C{Y) = K(fis) is replaced by the 
first marginal of Pjv- Here again, (Fn)^^* converges point-wise toward which 
admits fis as unique fixed point. One of the main feature of EM is the monotonic- 
ity of the objective function L^r along the algorithm. The drawback with such a 
basic EM approach for nonparametric NPML is the fact that the support is non 
increasing along the algorithm. 

3.2 Destruction the log-likelihood concavity for mixtures models 

The log-likelihood of mixtures models is a concave functional of the unknown 
mixing probability measure. However, this structure is very sensitive. Lindsay has 
showed in jT3] by simply using Minkowski-Caratheodory Theorem that the fully 
nonparametric NPML for mixtures models like (P) is achieved by an atomic prob- 
ability measure with at most iV + 1 atoms. By fully nonparametric, we mean that 
JF S = V(M. P ). This observation is enough robust to remain valid for heteroscedas- 
tic models as in Remark 11.31 Unfortunately, the parametrisation of such discrete 
probability measures in terms of weights and support points destroys the concav- 
ity of the log-likelihood objective function Ljy- This lack of concavity cannot be 
fixed by the introduction of a stochastic ordering on the set of discrete probability 
measures with at most N + 1 atoms. 

3.3 Semi-parametric estimation 

The convexity structure of the NPML problem is destroyed by the incorporation 
of fixed effects estimation. This is typically the case for mixed-effects models 
where a linear model structure is imposed to fis and where a is unknown in 
In such cases, the global log-likelihood, seen as a functional of both random and 
fixed effects, is not concave and has potentially many local maxima. The semi- 
parametric approach developed in [21] is useless since we do not have a consistent 
estimator of the fixed effects regardless of the random effect. 
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Recall that a typical mixed effects model corresponds to some particular struc- 
ture (a linear model in general) on the Si in (JIJ. Namely, Si = QVi + T]i, where 
Vi is an observed vector of per-individual co- variables (sex, weight, etc), where 
is an unknown matrix parameter giving the trend (fixed effect), and where rji 
is the random effect of unobserved data. In such a model, the (V^) ig N* and the 
{Vi)i£N* are i-i-d., and the {Tj, Vi, fy, £j, where i G N*} are mutually independent 
random variables. The goal is then to estimate the matrix and the common 
law fi v of the (r/j)i 6 N*- Such models are used for example in Biology to let the 
measurements take into account the known specificity of each individual while 
conducting a survey. The pattern, which is determined by physiological rules is 
given by the function /, while the specificity of each individual is modelled by 
the random variables (S^i^at. If we write Si = QVi + m + 77^ where m is a 
fixed parameter to be estimated and where 77^ is a centred random effect, one can 
first estimate the law of the centred random effect 7/ and then estimate the fixed 
effects and m. However, this approach must be adapted when the coefficient a 
in (JTJ is not known, since it appears in that case as a new fixed effect to be esti- 
mated. We believe that a semi-parametric extension of our method can be made, 
providing an estimation of (Q,fi v ). The approach presented in does not help 
since we do not have a consistent estimator for the fixed effects. Despite the fact 
that numerous nonparametric techniques were developed for mixtures models, the 
widely used approach in applications of nonlinear mixed effects models is quite 
rough and consists in a fully parametric estimation of the first two moments of 
the law fi v of the random effect rj, where it is arbitrarily assumed that this law is 
normal or log-normal, cf. [201 IS] and [Zj for example. Even if they speed up the 
effective computations, such fully parametric approaches are not satisfactory since 
the consequences it terms of decision are highly sensitive to the arbitrarily chosen 
structure for the random effect law (not robust). 

3.4 No rates 

To obtain rates of convergence for the maximum likelihood estimator, we con- 
sider a neighbourhood of the true distribution defined by the topology cho- 
sen according to fulfils the conditions of Theorem 11.21 Write V(/-ts) this neigh- 
bourhood, then using compacity there exist a finite sequence of neighbourhood 
V(fik), k — 1, . . . , rjv such that 

f s -%)cu? =1 %). 
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Hence, finding the rate of convergence of nonparametric maximum likelihood es- 
timator implies studying the deviation probability 

r N 



for < 7 < 1 as it is quoted in [21] . Bounding this deviation inequality requires 
two main ingredients. First a bound for the entropy of the mixture class. Recent 
works by van der Vaart, see for instance JTU] and [12! , give upper bounds for the 
entropy of such classes and hence provide a control over r N . Second, to conclude, 
there is a need for a deviation inequality over the previous empirical process. 
Unfortunately, to our concern, concentration bounds in this framework are very 
difficult to obtain, preventing further calculations to obtain rates of convergence. 
Work in this direction was conducted by van de Geer in [21] but can not be applied 
in this framework. Thus, it seems rather difficult to obtain rates of convergence 
for nonparametric maximum likelihood estimator using this settings. 

3.5 No sieves 

In order to construct a practical maximum likelihood estimator, one needs 
to construct a family of finite dimensional spaces undergoing the assumptions of 
Theorem (|2.1|) . Two main choices are investigated in the statistical literature, but 
none fulfils all the needed requirements. 

On the one hand, we could consider sieves constructed on log bases. Indeed, 
for a basis {i/j\)\ga of an Hilbert space, consider for a fixed integer m the set 



where A m C A with |A m | ^ m. If we have taken spline basis for our initial choice 
of we get the traditional log-spline model, well studied by Stone in j2EJ- Such 
sets are made of densities but are not compact for the chosen topology. 

On the other hand consider a Multiresolution analysis, see for instance [TSj . 
constructed using a wavelet basis, (Ca)aga- Hence the finite dimensional sets corre- 
sponding to the approximation spaces are defined by Ts, m = {f = J2\eA /^aCa}- 
Notice that !Fs,m is a closed convex subset of an Hilbert space. However, it is not 
a subset of set of the densities. This drawback appears frequently when esti- 
mating densities by wavelet estimators: the estimate is not a density. This defect, 
which in standard issues is not redhibitory, prevents here the use of Theorem (j2.1j) . 



P(f^~ N i V{ns)) < e V(ji k )) 



k=l 
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Conclusion 

We have shown that the nonparametric maximum likelihood estimator for (JIJ 
is consistent. However, the practical construction of usable sieves in the spirit of 
Section El is questionable. Improvements and rates of convergence are difficult to 
obtain in these setting. In the case where a large number of observations for each 
subject are available, i.e. n — > +00, the problem can be divided in two sub-issues: 
first estimate the random effect and then build a nonparametric estimator of its 
density. This point of view is tackled for example in [3] or However, when 
there is no hope for more data, in particular when dealing with medical data for 
which typically n is less than 5, we believe that other types of estimators should 
be considered. 
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